Unsupervised model-free representation learning 
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Abstract 

Numerous control and learning problems face the situa- 
tion where sequences of high-dimensional highly depen- 
dent data are available, but no or little feedback is pro- 
vided to the learner. To address this issue, we formulate 
the following problem. Given a series of observations 
Xq, . . . ,X n coming from a large (high-dimensional) 
space X, find a representation function / mapping X to 
a finite space y such that the series f(Xo), . . . , f(X n ) 
preserve as much information as possible about the orig- 
inal time-series dependence in Xq, . . . ,X n . We show 
that, for stationary time series, the function / can be se- 
lected as the one maximizing the time-series information 
h (f(X)) - h oc (f(X)) where h (f(X)) is the Shannon 
entropy of f(Xo) and hoo(f(X)) is the entropy rate of 
the time series /(Xq), . . . , f(X n ), .... Implications for 
the problem of optimal control are presented. 



1 Introduction 

In many learning and control problems one has to 
deal with the situation where the input data is high- 
dimensional and abundant, but the feedback for the learn- 
ing algorithm is scarce or absent. In such situations, find- 
ing the right representation of the data can be the key to 
solving the problem. The focus of this work is on prob- 
lems in which all or a large significant part of the relevant 
information is in the time-series dependence of the pro- 
cess. This is the case in many applications, starting with 
speech or hand-written text recognition, and, more gen- 
erally, including control and learning problems in which 
the input is a stream of sensor data of an agent interacting 
with its environment. 



A more formal exposition of the problem follows. 
First, assume that we are given a stationary sequence 
Xq, . . . , X n , . . . where Xi belong to a large (continuous, 
high-dimensional) space X. For the moment, assume that 
the problem is non-interactive (the control part is intro- 
duced later). We are looking for a compact representation 
f(X ), ..., f(X n ), . . . where f(Xi) belong to a small 
(for example, finite) space y. 

Let us first consider the following "ideal" situation. 
There exists a function / : X — >• y such that each ran- 
dom variable Xi is independent of the rest of the sam- 
ple X ,..., Xi-i,X i+ i, ...,X n given f(X t ) (for each 
i, n <G N). That is, all the time-series dependence is in the 
sequence f(Xo), ■ ■ ■ , /(Xn), and, given this sequence, 
the original sequence Xo, . . . , X n , . . . can be considered 
as noise, in the sense that Xi are conditionally indepen- 
dent. In this case we say that (X,)j e jn are conditionally 
independent given (f(Xi))i € ^. We can show that in this 
"ideal" situation the function / maximizes the following 
information criterion 



*»(/):= >»(/(*o))-M/(*)) 



(1) 



where h(f(Xo)) is the Shannon entropy of the first ele- 
ment and /ioo is the entropy rate of the (stationary) time 
series /(Xq), . . . , f(X n ), .... This means that for any 
other function g : X — > y we have Iao(f) > loo(g), 
with equality if and only (Xj)j S n are a ls° conditionally 
independent given (g(Xi))i^^. 

This allows us to pass to the non-ideal situation, in 
which there is no function / that satisfies the conditional 
independence criterion. Given a set of functions mapping 
X to y the function that preserves the most of the time- 
series dependence can be defined as the one that maxi- 
mizes (Q}. Such a function / can be said to preserve the 
most of time-series dependence of the original time series 



(Xj)j G n (as opposed to the ideal case, in which such a 
function / preserves all of the time-series dependence). 

For a given function /, the quantity (Q]) can be estimated 
empirically. Moreover, we can show that under certain 
conditions it is possible to estimate (fl]i uniformly over a 
set T of functions / : X — >• y. Importantly, the estima- 
tion can be carried out without estimating the distribution 
of the original time series (Xj)j £ n. 

Of particular interest (especially to control problems) 
is the case where the time series (X^)igN form a Markov 
process. In this case, in the "ideal" situation (when 
(Xj)jgN are conditionally independent given (f(Xi))i e m) 
one can show that the process (/(-X»))igN is also Markov, 
and !„(/) = h(f) := h(f(X )) - h(f(X 1 )\f(X )). 
In general, we show that in the Markov case to select a 
function that maximizes I<x>(f) it is enough to maximize 
h(f). 

Next, assume that at each time step i we are allowed 
to take an action Ai, and the next observation X; + i de- 
pends not only on Xq, . . . ,X n but also on the actions 
A\, . . . , A n . Thus, we are considering the control prob- 
lem, and the time series (X;); £ n do not have to be station- 
ary any more. In this situation, the time-series information 
I<x>{f) becomes dependent on the policy of the learner 
(that is, on the way the actions are chosen). However, 
we can show that in the Markov case, under some mild 
connectivity conditions, to select the function / that max- 
imizes Too (/), it is enough to consider just one policy that 
takes all actions with non-zero probability. This means 
that one can find the representation function / while ex- 
ecuting a random policy, without any feedback from the 
environment (i.e., without rewards). One can then use this 
representation to solve the target control problem more 
easily. 

Prior work. Learning representations, feature learn- 
ing, model learning, as well model and feature selection, 
are different variants and different names of the same gen- 
eral problem: making the data more amenable to learn- 
ing. From the vast literature available on these problems 
we only mention a few that are somehow related to the 
approach in this work. First, note that if in our "ideal" 
(conditional independence) case, if we further assume that 
(Xi) form a Markov chain, then we get a special case of 
Hidden Markov models (HMM) QO). Indeed, as it was 
mentioned, in this case f(Xi) form a Markov chain, and 
thus can be considered hidden states; the dependence be- 



tween f(Xi) and X{ is deterministic, as opposed to ran- 
domized in HMM, so we get a special case. Thus, the 
general case (non-ideal situation, Xi are not necessarily 
Markov) can be considered a generalization of HMM. 
From a different perspective, if Xi are independent and 
identically distributed and, instead of the time-series de- 
pendence (which is absent in this case), we want to pre- 
serve as much as possible of the information about an- 
other sequence of variables (labels) Yi, . . . , Y n , then one 
can arrive at the information bottleneck method [ 1 8 1 . The 
information bottleneck method can, in turn, be seen as 
a generalization of the rate-distortion theory of Shannon 
1 16 1 . Applied to dynamical systems, the information bot- 
tleneck method can be formulated |1| as follows: mini- 
mize /(past; representation) — /^/(representation; future), 
where j3 is a parameter. A related idea is that of causal 
states [131: two histories belong to the same causal state 
iff they give the same conditional distribution over fu- 
tures. 

What distinguishes the approach of this work from 
those described, is that we never have to consider the 
probability distribution of the input time series Xi di- 
rectly — only through the distribution of the represen- 
tations f(Xi). Thus, modelling or estimating Xi is not 
required; this is particularly important for empirical esti- 
mates. 

For the control problem, to relate the proposed ap- 
proach to others, first observe that in the case of an MDP, 
in the "ideal" scenario, that is, in the case when there ex- 
ists a function / : X — >• y such that (Xj)j g N are condi- 
tionally independent given (/(Xj))j 6 n, then for any states 
x, x' <G X for which f(x) = f(x') all the transition prob- 
abilities are the same. In other words, states x,x' <G X 
for which f(x) = f{x') are equivalent in a very strong 
sense, and the function / can be viewed as state aggrega- 
tion. Generalizations of this equivalence and aggregation 
(in the presence of rewards or costs) are studied in the 
bisimulation and homomorphism literature ll3] |2] [T7] ITT1 . 
The main difference of our approach (besides the absence 
of rewards) is in the treatment of approximate (non-ideal) 
cases and in the way we propose to find the representa- 
tion (aggregation) functions. In bisimulation this is ap- 
proached via a metric on the state space defined using a 
distance between the transition (and reward) probability 
distributions, which then has to be estimated 12] [17). In 
our approach, all that has to be estimated concerns the rep- 



reservations f(X), rather than the observations (states) X 
themselves. 

The problem of finding a (concise) representation of the 
input space such that the resulting process on representa- 
tions is Markovian has also been studied in 1 8 , 9 1 . Another 
related approach is finding representations based on com- 
pression J6). Note that all these approaches consider the 
supervised version of the problem. 

It should also be noted that the conditional indepen- 
dence property has been previously studied in a different 
context (classification) in [ 14- 1 . The latter work shows that 
if the objects (Xj)j £ n are conditionally independent given 
the labels (5^)igN then, effectively, one can use classifica- 
tion methods developed to work in the case of i.i.d. object- 
label pairs. Combined with the results of this work this 
means that in the ideal (conditional independence) case 
one can decompose a learning problem into i.i.d. clas- 
sification and learning the time-series dependence. It is 
also worth noting that the quantity (Q]i has been studied 
in a different context: [12| uses it to construct statistical 
test for the hypothesis that a time series consists of in- 
dependent and identically distributed variables. Further- 
more, one can show (see below) that for stationary time 
series Ioo{f) equals to the following mutual information 
I(Xq; X_i, X_2, . . . ); this characteristic of time series 
has been extensively studied [4|. 

Organization. The rest of the paper is organized as 
follows. Section |2] introduces some notation and defini- 
tions. Section[3]introduces the model and gives the main 
results concerning representation functions for stationary 
time series. Section [4] considers the special case of (sta- 
tionary) Markov chains; Section|5]presents results on uni- 
form empirical approximation of time-series information. 
Finally, Section [6] extends the model and results to the 
control problem. Some longer proofs are deferred to Sec- 
tion 



2 Preliminaries 

Let (X, Tx) and (y, Ty) be measurable spaces. X is 
assumed to be large (e.g., a high-dimensional Euclidean 
space) and y small. For simplicity of exposition, we as- 
sume that y is finite; however, the results can be extended 
to infinite (and continuous) spaces y as well. 

Time-series (or process) distributions are probability 



measures on the space (X N , J 7 -^) of one-way infinite se- 
quences (where Jj$ is the Borel sigma-algebra of X N ). 
We use the abbreviation Xq.j. for Xq, . . . , X^, A distri- 
bution p is stationary if p(X .. k E A) = p(X n+1 .. n+k E 
A) for all A E J~x k ^ k,n E N (with F X k being the sigma- 
algebra of X k ). 

A stationary distribution on X N can be uniquely ex- 
tended to a distribution on X z (that is, to a time series 
. . . , X_i, JTo, Xi, . . .); we will assume such an extension 
whenever necessary. 

For a random variable Z denote h(Z) its entropy. De- 
fine h(f) as the entropy of f(X ) 

h (f) := h(f(X )), (2) 

and hk(f) the fc-order entropy of /(X) 

h k (f) := -E Xot ... >Xh h(f(X k )\f(X ), ..., /(X fe _!)) 

(3) 
For stationary time series (f(Xi))i^ the entropy rate is 
defined as 

hooif) ■= lim h k (f). 

k— >oo 

When we speak about conditional distributions the equal- 
ity of distributions should be understood in the "almost 
sure" sense. 



3 Time-series information for sta- 
tionary distributions 

This section describes the main results concerning repre- 
sentation functions for stationary time series. We first in- 
troduce the "ideal" situation in which (Xj)j e ^ are condi- 
tionally independent given (/(Jfj))j £ N f° r some function 
/ : X — > y, and define time-series information. We then 
show that under this condition the function / maximizes 
time-series information. 

Definition 1 (conditional independence given labels). We 
say that (Xi) i€ fi are conditionally independent given 
(f(Xi))i e f$, if for all n,k, and all ii,...,ik ^ n X n 
is independent of Xi 1 , . . . , Xi k given f(X n ): 

P(X n \f(X n ),X ny . ..,X lk )= P(X n \f(X n )) a.s. (4) 



Definition 2. The time-series information of a series 
/(Xq), . . . , /(X n ), ... is defined as 



Iao(f) := h (f) ~ hooif) 



(5) 



We can also define fc-order time-series information as 
follows 

W) := h Q (f)-h k (f) = I(/(X fc ); /(X ), ..., /(Jffc-0) 

The following lemma helps to understand the nature of 
the quantities Ioo(/) and I k {f)- 

Lemma 1. If the time series (Xi)igz is stationary then 

Ioo(f) = I(f(X ); /(X_i), /(X_ 2 ), . . . ). (6) 
Proof Denote Yj := f(Xi). We have 

/«,(/)= iim ft(y )-Mi r o|i r -i,...,y-it) 

fc— >-oo 

= lim J(Y ; F_!, . . . , Y_ k ) = I(Y ; Y_ 1} Y_ 2) ...), 

k— >oo 

where the first equality follows from the stationarity of 
(Xj)j G z and for the last see, e.g., ||4. Lemma 5.6.1] □ 

The following is the main result concerning represen- 
tations of stationary time series. Its proof is given in sec- 
tion |7] 

Theorem 1. Let f : X — > y be such that (Xi)igN are 
conditionally independent given (/(Xi))igN- Then for 
any g : X ^r y we have Ioo(f) > loo(g), with equal- 
ity if and only if (Xj)j 6 n are conditionally independent 
given (g(Xi)) teN . 

Thus, given a set T of representation functions / : 
X — > y, the function that is "closest" to satisfying the 
conditional independence property [T]c an be defined as the 
one that maximizes Q. If the set T is finite and the time 
series (Xi)i G N is stationary, then it is possible to find the 
function that maximizes (0 given a large enough sam- 
ple of the time series, without knowing anything about its 
distribution. Indeed, it suffices to have a consistent esti- 
mator for ho (/) and a consistent estimator for the entropy 
rate ft-oo(/)- The former can be estimated using empirical 
plug-in estimates, and the latter using, for example, data 
compressors, see, for example, lfT2l[T3l . 



The situation is more difficult if the space of represen- 
tation functions is infinite (possibly uncountable); more- 
over, we would like to introduce learner's actions into the 
process, potentially making the the time series (Xi)i e pj 
non-stationary. 

These scenarios are considered in the following sec- 
tions. For the control problem, a special role is played 
by Markov environments; we first look at the simplifica- 
tions gained by making this assumption in the stationary 



4 Time-series information 
Markov chains 



for 



If the (Xi)igN form a stationary (fc-order) Markov pro- 
cess then the situation simplifies considerably. First 
of all, if (Xi)igN are conditionally independent given 
(f(Xi))i£j>] then (/(X;))i £ N also form a stationary (fc- 
order) Markov chain. Moreover, to find the function that 
maximizes the time-series information (Q3 it is enough 
to find the function that maximizes a simpler quantity 
I k (f) = /(/(X ); /(Xx), . . . , f(X k )), as the following 
theorem shows. In the theorem and henceforth, for the 
sake of simplicity of notation, we only consider the case 
k = 1; the general case is analogous. 

Theorem 2. Suppose that Xj form a stationary Markov 
process and (Xj)j 6 n are conditionally independent given 
(f(Xi)) ieN . Then 

(i) (/(Xj))j S N also form a stationary Markov chain. 

(ii) In this case 1^ (/) is the mutual information between 
f(X )andf(X 1 ): 

Ioo(f) = h(f)=I(f(Xo)J(X 1 )), (7) 

and for any g : X -^ y we have I\{f) > I\{g) 
with equality if and only if(Xi)i^jq are conditionally 
independent given (g(Xi))i e iq. 

Proof. We use the notation Yj := f(Xi), For the first 
statement, observe that 

ft(Y n+1 |Y! ...,Y n ) = hO^+^YuXu . . . , Y„,X„) 

= h(Y n+l \Y n ,X n ) = h(Y n+1 \Y n ), (8) 



where we have used successively conditional indepen- 
dence, the Markov property for (Xi)i^ and again con- 
ditional independence. 

For the second statement, first note that hoo = hi 
for Markov chains, implying ©. Next, for any g : 
X — > y the process g(Xi) is stationary, which implies 
h>oo(g{X)) < hi(g(X)). Thus, using TheoremQ] we ob- 
tain 



h{!) = /«,(/) > Ioo(g) > h (g) - h x {g) = h{g). 



n 



5 Uniform approximation of Ik ( / ) 

Given an infinite (possibly uncountable) set T of func- 
tions / : X -> y we want to find a function that maxi- 
mizes loo(f). From the previous section we know that in 
the case of fc-order Markov chains /&(/) = loo (/); other- 
wise, the former can be still considered an approximation 
for the latter since Ik(f) —> loo(f)- Here we consider 
the problem of approximating Ik(f), and leave the gen- 
eral case for future work. Note that for the results of this 
section it is not required that JQ form a (fc-order) Markov 
chain. 

Since we do not know /&(/), we can select a function 
that maximizes the empirical estimate /&(/)• The ques- 
tion arises, under what conditions is this procedure con- 
sistent? The conditions needed are of the following two 
types: first, the set T should be sufficiently small, and, 
second, the time series (Xi)i^ should be such that uni- 
form (over F) convergence guarantees can be established. 
Here the first condition is formalized in terms of VC di- 
mension, and the second in terms of mixing times. We 
show that, under these conditions, the empirical estima- 
tor is indeed consistent and learning-theory-style finite- 
sample performance guarantees can be established. 

As before, for the sake of simplicity we consider the 
case k = 1; the general case is analogous. 

For a function / : X — >• y and a sample 
X%, . . . , X n define the following estimators. p/(y) := 
■k Efe=i Kf( x ) = V) and analogously for j3/(j/i, ...,y k ) 
and multivariate entropies. 



tribution p define the mixing coefficients 

(5{ Pl k):= sup \p(AnB)-p(A)p(B)\ 

where <r(..) denotes the sigma-algebra of the random 
variables in brackets. 

When j3(p, k) — > the process p is called absolutely 
regular; this condition is much stronger than ergodicity, 
but is much weaker than the i.i.d. assumption. 

The general tool that we use to obtain performance 
guarantees in this section is the following bound that can 
be obtained from the results of [7| . Let J 7 be a set of VC 
dimension d and let p be a stationary distribution. Then 

i n 
q n (p,F,s) := p(Bup\-^2g(Xi)-E p g(Xi)\ > e) 



<n/3(p,t„) + 8t° + e 



d+l a -l n e 2 



(9) 



where t n is integer in \..n and l n = n/t n . The parame- 
ters t n should be set according to the values of /3 in order 
to optimize the bound. 

Furthermore, assume geometric /3-mixing distribu- 
tions, that is, /3(p,t) < 7* for some 7 < 1. Letting 
l„ = t n = \pn the bound (O becomes 

qn{p,T,e) < n^+8n {d+1 ^ 2 e-^ £2 ^ =: A{d,e,n,j). 

(10) 
Geometric /3-mixing properties can be demonstrated 
for large classes of (k-order) (PO)MDPs [5 1, and for many 
other distributions. 

Theorem 3. Let the time series (Xj)j g pj be generated 
by a stationary distribution p whose /3-mixing coefficients 
satisfy (3(p.,t) < 7* for some 7 < 1. Let T be a set of 
functions f : X — > y such that for each y £ y the VC di- 
mension of the set {t{ x ex-.g(x)=y} '■ 9 £ J~} is not greater 
than d. Furthermore, assume that there exist an a > 
such that for any g £ T and any yi,y2 G y we have 



P(g{X )=vi i g(X 1 ) = y 2 ))> 



(11) 



Then 



P(SU P |Il (<?)-Il (<7)|>£) 

gef 

<4|}fA(7d,-£/|;y| 2 loga,n-l,7) (12) 



Definition 3 (/3-mixing coefficients). For a process dis- for every e < a. 



The proof is deferred to Section |7] 

6 The active case: MDPs 

In this section we introduce learner's actions into the pro- 
tocol. The setting is a sequential interaction between the 
learner and the environment. Given are a space of ob- 
servations X and of a space actions A, where A is as- 
sumed finite. At each time step i G N the environ- 
ment provides an observation Xi, the learner takes an 
action A;, then the next observation Xi + i is provided, 
and so on. Each next observation Xi + i is generated 
according to some (unknown) probability distribution 
P(Xi+\ \Xq, Aq, . . . , Xi, Ai), Actions are generated by 
a probability distribution ir that is called a policy; in gen- 
eral, it has the form ir(Ai + i\Xo, Aq, . . . , Xi, Ai, Xi + i). 
Note that we do not introduce costs or rewards into con- 
sideration. Thus, we are dealing with an unsupervised 
version of the problem; the goal is just to find a concise 
representation that preserves the dynamics of the problem. 

Definition 4 (conditional independence, active case). For 
a policy ir, an environment P and a measurable function f 
we say that (Xi)i e jq are conditionally independent given 
(f(Xi)) ieN under the policy n if 

P"(X n \f(X n ),A n ,Xi 1 ,A il ,...,Xi k ,A ik ) 

= P 7 '{X n \f(X n ))a.s. (13) 

for all n,k € N, and all i\, . . . ,ik G N such that ij =/= n, 
j = l..k, where P'* refers to the joint distribution of Xi 
and Ai generated according to P and ir. 

The focus in this section is on time-homogeneous 
Markov environments, that is, on Markov Decision Pro- 
cesses (MDPs). Thus, we assume that X; + i only depends 
on Xi and Ai, that is, P can be identified with a function 
from X x A to the space 'P(X) of probability distributions 
on X 

P{X i+1 \X , A ,..., Xi-uAi-uXi = x, A % = a) 

= P(X l+1 \x,a) 

In this case observations Xi are called states. 

A policy is called stationary if each ac- 
tion only depends on the current state; that is, 



Tr(A i+ i\X ,A ,...,X h Ai,X i+ i = x) = ir(A\X) 
where for each x E X 7r(^4|;r) is a distribution over A. 

Call an MDP admissible if any stationary policy ir has 
a (unique up to sets of measure 0) stationary distribution 
P^ over states. The notation E*', P", h? ', I£, etc. refers 
to the stationary distribution of the policy ir. 

For MDPs we introduce the following policy- 
independent definition of conditional independence. 

Definition 5 (conditional independence, MDPs). For an 

admissible MDP and a measurable function f : X — > y 
we say that (Xi)i e iq are conditionally independent given 
(/(ATj))j g N if for every stationary policy ir, (Xi)i<zfq are 
conditionally independent given f(Xi) under policy -k. 

Call a stationary policy it stochastic if 7r(a|a;) > a > 
for every x € X and every a € A. 

Call an admissible MDP (weakly) connected if there 
exists a stationary policy 7r such that (equivalently: for 
every stochastic policy 7r) for any other stationary policy 
7t' we have P v ^$> P^ (that is, for any measurable S C 
X x A P 71 ' (S) > implies P*(S) > 0). 

For discrete MDPs this definition coincides with the 
usual definition of weak connectedness (for any pair of 
states s\, S2 there is a policy that gets from s\ to S2 in a 
finite number of steps with non-zero probability). 

Theorem 4. Fix an admissible weakly connected MDP 
and a stationary stochastic policy -k. Then (Xi)i e iq are 
conditionally independent given (f(Xi))i e iq ifandonlyif 
(Xj)j 6 n are conditionally independent given (/(Xi))i S N 
under ir. 

Proof. We only have to prove the "if" part (the other part 
is obvious). Let ir be any stationary policy. Introduce the 
notation Yi := f(Xi) and 

U :=(X. 1 ,A. 1 ,Ao,X 1 ,A 1 ). 

We have to establish ( TT31 for P 7 * ; note that since the pro- 
cess is Markov we can take k = 2, ii = 1, %i = —1 
in ( fT~3b w.l.o.g.; thus, we need to demonstrate 

P"°(X \Y ) = F*°{X \Y , U ) a.s. (14) 

Since the policy ir is stochastic, the measure P^ domi- 
nates P^° . Therefore, the following probability-one state- 
ments are non-vacuous: 

P*(X Q \Y Q ) = P^iXolYo, U ) = P^'iXolYa, U ) a.s. 



for all i € N, where the first equality follows from (TTjt . 
and the second follows from the fact that conditionally on 
the actions the distributions P 77 and P^ a coincide. More- 
over, 

P"°(X \Y ) = E%> o P*°(X \Y , U ) 

= El° o P"(Xo\Y Q ) = P*(X Q \Y ) a.s. 

Thus, (Xi)i e fif are conditionally independent given 
(f(Xi))iefii under ttq; since -kq was chosen arbitrary, this 
concludes the proof. □ 

Corollary 1. Fix an admissible MDP and a stationary 
stochastic policy -k. Assume that for some f : X — > y 
(Xj)jgN are conditionally independent given (f(Xi))i^^. 
If f = argmin g If (g) then (Xj)jgN are conditionally 
independent given (/'(Xj))jgN. 

Proof. The statement follows from Theorems Q] and |4] 

D 

Consider the following scenario. A real-life control 
problem is given, in which an (average, discounted) cost 
has to be optimized. In addition, a simulator for this prob- 
lem is available; running the simulator does not incur any 
costs, but also does not provide any information about the 
costs — it only simulates the dynamics of the problem. 
Given such a simulator, and a set T of representation 
functions, one can first execute a random policy to find 
the best representation function / as the one that max- 
imizes I\(f). Under the conditions given in Section [5] 
the resulting estimator is consistent. One can then use the 
representation function found to learn the optimal policy 
in the real problem (with costs). 

The problem of solving (efficiently) both problems to- 
gether — learning the representation and the finding the 
optimal policy in a control problem — is left for future 
work. 



7 Longer proofs 

Proof of Theorem\J] First note that from the definition (JTJ 
of conditional independence and using the chain rule for 
entropies, it is easy to show that for any n,k,ii,...,ik € 



N and for any (measurable) function /' : X — > y we have 

h(f(X n )\f(X il )J'(X il ),...J(X ik )J'(X ih )) 

= h(f(X n )\f(X n ),...J(X tk )), (15) 

so that 

h(f(X n )\f(X n ),...J(X tk )) 

<h(f(X n )\f(X h ),...,f(X ik )) (16) 

Consider the following entropies and information (with 
straightforward definitions): h (f,g), h k (f,g), h(f,g) 
and I oa (f,g). We will first show that 



h(f,g) = h(f) and loo (/,<?) = Ioo(f). 



(17) 



The latter equality follows from the former and the def- 
inition of hoc . To prove the former we will consider the 
case k = 1; the general case is analogous. Introduce the 
short-hand notation Y { := f(X i ),Z l := g{X % ), i e N. 
First note that 



ho(f,g) = h(Xo) + h{Zo\Yo). 



(18) 



Moreover, 



hi(f,g) = h(Y ,Z \Y_ 1 ,Z_ 1 ) 

= hfYolY.!, Z_i) + h(Z \Y , Y-uZ- X ) 

= h(Y \Y^) + h(Z Q \Y ) (19) 

where the first equality is by definition, the second is the 
chain rule for entropy and the third follows from ([To! 
and conditional independence of Xi given f(Xi). Thus, 
from d 1 8b . ( fT9] l and the definition of I\ (/) we get 

h(f,g) = h (f,g)-h 1 (f,g) 

= h(Y ) + h(Z \Y ) - h(Yo\Y_ t ) - h(Z \Y ) 

= h(f) 

finishing the proof of (I17t . 

To prove the theorem it remains to show that, 
if (Xj)jgN are not conditionally independent given 
(g(Xi)) ieN then 1^ (f) > 1^ (g). For that it is enough to 
show that 

h(f) > h(g) (20) 



from some k on. Assume that (Xi)i^ are not condition- Thus, using decomposition for conditional entropy 
ally independent given (g(Xi))i^, so that and d27l > we derive 

P{X n \g{X n ),X il ,...,X ik ) ^ P(X n \g(X n )) (21) h(Y \Z ,Y x ,Y_ x ) 

= h(Y , Z , Y X ,Y_ X ) - h(Z Q , Y x , Y_ x ) 
for some k, n and h,...,i k ^ n. By stationarity, we = h(Y x \Z , Y , Y_ x ) + h{Z Ql Y , Y_ x ) 

obtain from (I2TI 1 that there exists k £ N such that ,,_,,_, ^ r . , . _ _ r . 

- h{Y x \Z ,Y_ x ) - /i(Z ,Y_i) 

PM^),*, x h *_„ . . . , *_, - *»!*■ *'-'» + »«l*; >'-> ( 

^P(* oW X„)) (22) -*IKI4V- 1 )-»i'- 1 ). (29) 



Continuing in the same way but using Q28J we obtain 
We will show that d20j > holds for all k for which (1221 

holds. Clearly, if d22l holds for fc € N then it also holds h(Yo\Zn Y x Y_i) = /i(Yn|Zn) 

for all fc' > k. Thus, it is enough to consider the case 

k = 1; the general case is analogous. With this simpli- contradicting (|24]i. Thus, either (f25) or (|26) holds true; 

fication in mind, and using our Y and Z notation, note consider the former inequality — the latter one is analo- 

that d22l implies that gous. We have 

P(Y \Z ,X X ,X_ X )^P{Y \Z ), (23) /„(/) = h (y x ) _ h(Y x \Y , Y_ x ) 

, t , . ,. tP , ylV7 y y w >MYi)-MYi|Z ,Y_ 1 )>/ l (Y 1 )-MYi|Z ,Z_ 1 ) 
for otherwise we would get P{Xq\Yq, Zo,X x ,X- X ) j= 

P(X \Y , Z ), contradicting conditional independence of = /(Ti; Z °> Z ~ l > = 7 ( Z °' Z " i; Yl > 

X given f(X). Finally, from (|23) and |Q3) we get > /(Z , Z_i; Z x ) = J(Zi; Z 0) £_i) = I 2 (g), 

p/v 17 v v \ u T>lv 1 7 \ where we have used the definition of It., (l25l l. ( TTol l. 

the definition of mutual information and the symme- 

.1 t try thereof. This demonstrates (l20l and concludes the 

proof. D 

h(Yo\Z Q ) - h(Y \Z Q , Yi, Y_i) > 0. (24) Proo f f Theorem ^ First> note that from sta tionarity of 

. . (Xi) iG Nweget 

We will show that (124-b implies that at least one of the 

following two inequalities holds j^ = 2 h{g(X )) - h(g{X ), g(X x )) 

h(Y x \Z ,Y_ x ) > h(Y x \Y ,Y_ x ), (25) so that 

h(Y x \Z ) > h(Y x \Y ). (26) sup \I x (g) - I x (g)\ 

Indeed, if both (|25]) and (|26l) are false then from dB) < sup2\h(g(X )) - h(g(X ))\ 

and ( TTol l we obtain g£jr 

+sup |ft( ff (X ),5(^i))-/i(s(Xo),5(X 1 ))| =: T 1+ T 2 . 

h(Y x \Z , Y_i) = MY |Y , Y_i) = ft(Yi |Y , Z , Y_i), ^ 



(27) 
and 



Introduce the shorthand notation 
h(Y x \Z ) = h(Y x \Y ) = h(Y x \Y ,Z ). (28) ^(y) := P(<?(X ) = y). 

8 



From the conditions of the theorem we know that p g (y) > 
a for any g, y; but we also will need the same to hold for 
the estimates p. So, consider the following event 

£:={inf Mp(g(X ) = y)< a/2], 

gefyey 

and the following simple decomposition 

P{Ti >e)< P(B) + P{T X > e\->B). (30) 
From (fTTb and the bound ( fTOb we obtain 

P(B)<\y\A(d,a/2,n, 1 ). (31) 

The Taylor expansion of a function u differentiable 
around t can be expressed as u(i) — u(c) + (t — c)u'(6c+ 
(1 — 9)t) for some 6 £ (0, 1). Using this for the function 
u(p) = plogp we obtain 

\h (g) - h (g)\ 

= \^2(Pg(y) !ogp s (y) - Pg(y) log p g (y)\ 
yey 

^\Y,(p 9 (y)-pM)(i+^g(e Pg (y) + (i-e)p g ( y ))\ 



yey 



< -log a Y^ \Pg(v) ~ Pg(y)\ 



yey 



1 - 

-£ ] 



= -loga^|EI g(Xo)=y - - 2^,*g{x )=v\> 
yey i=i 

where the last inequality uses (fTTl i and holds under the as- 
sumption p g (y) > a/2 for all y € y (that is, conditional 
on -.£). From this, (EB, d30]l and ([Toll we obtain 

P{Ti > e) 

< \y\ (A(d, a/2, n, 7 ) + A(d, -e/\y\ log a, n, 7 )) 
<2|y|A(d J -e/|y|loga,n,7), 

where we have used e < a. 

It remains to repeat the same analysis for T 2 . The 
difference is that instead of the entropy of one variable 
h(g(Xo)) we are have to deal with the entropies of pairs 
h(g(Xo),g(Xi)). First, observe that, from the definition 
of mixing, if a process p generating _X*o, X\, X%, ... is 
mixing with coefficients f3(p, k) then the process made of 



pairs (Xg,Xi), (Xi, X2), ... is mixing with coefficients 
f3(p, k — 1). Next, for the VC dimensions, observe that if 
a set 

{{x : g(x) = y } :geF] 

has VC dimension d (for every y 6 y) then the set of 
pairs 

{({x : g(x) = y}, {x : g'(x) = y'}) : g, g G F} 

has VC dimension bounded by Id (for all y,y' € y); see 
[19 1, which also gives a more precise bound. Now we 
can repeat the derivation for T2, and obtain the resulting 
bound |Q2) ■ □ 
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